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Abstract. Several recent studies in privacy-preserving learning have considered the trade-off be- 
tween utility or risk and the level of differential privacy guaranteed by mechanisms for statistical 
query processing. In this paper we study this trade-off in private Support Vector Machine (SVM) 
learning. We present two efficient mechanisms, one for the case of finite-dimensional feature map- 
pings and one for potentially infinite-dimensional feature mappings with translation-invariant ker- 
nels. For the case of translation-invariant kernels, the proposed mechanism minimizes regularized 
empirical risk in a random Reproducing Kernel Hilbert Space whose kernel uniformly approximates 
the desired kernel with high probability. This technique, borrowed from large-scale learning, allows 
the mechanism to respond with a finite encoding of the classifier, even when the function class is of 
infinite VC dimension. Differential privacy is established using a proof technique from algorithmic 
stability. Utility — the mechanism's response function is pointwise e-close to non-private SVM with 
probability 1 — S — is proven by appealing to the smoothness of regularized empirical risk minimiza- 
tion with respect to small perturbations to the feature mapping. We conclude with a lower bound on 
the optimal differential privacy of the SVM. This negative result states that for any S, no mechanism 
can be simultaneously (e, 5)-useful and /3-differentially private for small e and small fl. 



1. Introduction 

The goal of a well-designed statistical database is to provide aggregate information about a database's 
entries while maintaining individual entries' privacy. These two goals of utility and privacy are inherently 
discordant. For a mechanism to be useful, its responses must closely resemble some target statistic of the 
database's entries. However to protect privacy, it is often necessary for the mechanism's response distribution 
to be 'smoothed out', i.e., the mechanism must be randomized to reduce the individual entries' influence on 
this distribution. It has been of key interest to the statistical database community to understand when the 



goals of utility and privacy can be efficiently achieved simultaneously ( 


Dinur and Nissim 


2003 Barak et al.| 


2007 Dwork et al. 


2007 


Blum et al. 


2008 


Chaudhuri and Monteleoni 


2009 


Kasiviswanathan et al.| 2008). 



In this paper we consider the practical goal of private regularized empirical risk minimization (ERM) in 
Reproducing Kernel Hilbert Spaces for the special case of the Support Vector Machine (SVM). We adopt the 



strong notion of differential privacy as formalized by Dwork (20061. Our efficient new mechanisms are shown 
to parametrize functions that are close to non-private SVM under the Loo-norm, with high probability. In 
our setting this notion of utility is stronger than closeness of risk (cf. Remark [3}. 

We employ a number of algorithmic and proof techniques new to differential privacy. One of our new 
mechanisms borrows a technique from large-scale learning, in which regularized ERM is performed in a 
random feature space whose inner-product uniformly approximates the target feature space inner-product. 
This random feature space is constructed by viewing the target kernel as a probability measure in the Fourier 
domain. This technique enables the finite parametrization of responses from function classes with infinite 
VC dimension. To establish utility, we show that regularized ERM is relatively insensitive to perturbations 
of the kernel: not only does the technique of learning in a random RKHS enable finitely-encoded privacy- 
preserving responses, but these responses well-approximate the responses of non-private SVM. Together 
these two techniques may prove useful in extending privacy-preserving mechanisms to learn in large function 
spaces. To prove differential privacy, we borrow a proof technique from the area of algorithmic stability. We 
believe that stability may become a fruitful avenue for constructing new private mechanisms in the future, 
based on learning maps presently known to be stable. 
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Of particular interest, is the optimal differential privacy of the SVM, which loosely speaking is the best 
level of privacy achievable by any accurate mechanism for SVM learning. Through our privacy-preserving 
mechanisms for the SVM, endowed with guarantees of utility, we upper bound optimal differential privacy. 
We also provide lower bounds on the SVM's optimal differential privacy, which are impossibility results for 
simultaneously achieving high levels of utility and privacy. 

The remainder of this paper is organized as follows. After concluding this section with a summary of 
related work, we recall basic concepts of differential privacy and SVM learning in Section [2] Sections [3] and [4] 
describe the new mechanisms for private SVM learning for finite-dimensional feature maps and (potentially 
infinite-dimensional) feature maps with translation-invariant kernels. Each mechanism is accompanied with 
proofs of privacy and utility bounds. Section [5]considers the special case of hinge loss and presents an upper 
bound on the SVM's optimal differential privacy. A corresponding lower bound is then given in Section [6] 
We conclude the paper with several open problems. 

1.1. Related Work. There is a rich literature of prior work on differential privacy in the theory community. 
The following sections summarize work related to our own, organized to contrast this work with our main 
contributions. 

1.1.1. Range Spaces Parametrizing Vector-Valued Statistics or Functions with Finite VC-dimension. Early 
work on private interactive mechanisms focused on approximating real- and vector-valued statistics (e.g. 



Dinur and Nissim 


2003 


Blum et al. 


2005 


Dwork et al. 


2006 


Dwork 


2006 


Barak et al. 


and Talwar 


(2007 


first considered private mechanisms with range spaces parametrizing 



20071. McSherry 



than real- valued vectors, and used such differentially private mappings for mechanism design. More related to 
our work are the private mechanisms for regularized logistic regression proposed and analyzed by |Chaudhu"n| 
and Monteleoni (20091. There the mechanism's range space parametrizes the VC-dimension d + 1 class of 



linear hyperplanes in 



Kasiviswanathan et al. 



( 2008 ) showed that discretized concept classes can be PAC 



(2008) showed 



learned or agnostically learned privately, albeit via an inefficient mechanism. Blum et al. 
that non-interactive mechanisms can privately release anonymized data such that utility is guaranteed over 
classes of predicate queries with polynomial VC dimension, when the domain is discretized. |Dwork et al.| 
(2009) more recently characterized when utility and privacy can be achieved by efficient non-interactive 
mechanisms. In this paper we consider efficient mechanisms for private SVM learning, whose range spaces 
parametrize real- valued functions (whose sign form trained classifiers). One case covered by our analysis is 
learning with a Gaussian kernel, which corresponds to learning over a class of infinite VC dimension. 

1.1.2. Practical Privacy-Preserving Learning (Mostly) via Subset-Sums. Most prior work in differential pri- 
vacy has focused on the deep analysis of mechanisms for relatively simple statistics (with histograms and 



contingency tables as explored by Blum et al. 2005 and Barak et al. 2007 respectively, as examples) and 
learning algorithms (e.g., interval queries and half-spaces as explored by Blum et al.||2008 1, or on construct- 
ing learning algorithms that can be decomposed into subset-sum operations (e.g., perceptron, fc-NN, ID3 as 
described by |Blum et al.|2005| and various recommender systems due to the work of |McSherry and Mironov] 
2009). By contrast, we consider the practical goal of SVM learning, which does not decompose into subset- 
sums. It is also notable that our mechanisms run in polynomial time. The most related work to our own 



in this regard is due to Chaudhuri and Monteleoni (20091, although their results hold only for differentiable 



loss, and finite feature mappings. 

1.1.3. The Privacy- Utility Trade-Off. Like several prior studies, we consider the trade-off between privacy 
and utility. Barak et al. (20071 presented a mechanism for releasing contingency tables that guarantees 
differential privacy and also guarantees a notion of accuracy: with high probability all marginals from 
the released table are close in Li-norm to the true table's marginals. As mentioned above, |Blum et al.| 
(2008) developed a private non-interactive mechanism that releases anonymized data such that all predicate 
queries in a VC-class take on similar values on the anonymized data and original data. In the work of 



Kasiviswanathan et al. (2008), utility corresponds to PAC learning: with high probability the response and 
target concepts are close, averaged over the underlying measure. 

A sequence of prior negative results have shown that any mechanism providing overly accurate responses 
cannot be private (Dinur and Nissim 2003| Dwork et al. 2007 Dwork and Yekhanin 



Nissim 



2008) 



Dinur and 



(2003) showed that if noise of rate only o(y / n) is added to subset sum queries on a database of bits 
then an adversary can reconstruct a 1 — o(l) fraction of the database. This is a threshold phenomenon that 
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says if accuracy is too small, privacy cannot be guaranteed at all. This result was more recently extended 



to allow for mechanisms that answer a small fraction of queries arbitrarily (Dwork et al. 20071. We show a 



similar negative result for the private SVM setting: any mechanism that is too accurate with respect to the 
SVM cannot guarantee strong levels of privacy. 

1.1.4. Connections between Stability, Robust Statistics, and Global Sensitivity. To prove differential privacy, 



we borrow a proof technique from the area of algorithmic stability. In passing Kasiviswanathan et al. ( 2008 1 



note the similarity between notions of algorithmic stability and differential privacy, however do not exploit 
this. The connection between algorithmic stability and differential privacy is qualitatively similar to the 



recent work of Dwork and Lei (20091 who demonstrated that robust estimators can serve as the basis for 



private mechanisms, by exploiting the limited influence of outliers on such estimators. 

2. Background & Definitions 

A database D is a sequence of n > 1 entries or rows (xi,?/i) € K d x { — 1, 1}, which are input point-label 
pairs or examples. We say that a pair of databases D\ , are neighbors if they differ on one entry. A 
mechanism M is a service trusted with access to a database D, that releases aggregate information about 
D while maintaining privacy of individual entries. By M(D) we mean the response of M on D. We assume 
that this is the only information released by the mechanism. Denote the range space of M by Tm- We adopt 



the following strong notion of differential privacy due to Dwork (20061 



Definition 1. For any (3 > 0, a randomized mechanism M provides (3 -differential privacy, if, for all neigh- 
boring databases D\, Di and all responses t £ Tm , 

'Pr(M(L>i) = t) s 



log 



< a 



Pr (Af(Da) =i). 

The probability in the definition is over the randomization in M. For continuous Tm we mean by this 
ratio a Radon-Nikodym derivative of the distribution of M(Di) with respect to the distribution of M(Z?2)- 
If an adversary knows M and the first n — 1 entries of D, she may simulate the mechanism with different 
choices for the missing example. If the mechanism's response distribution varies smoothly with her choice, 
the adversary will not be able to infer the true value of entry n by querying M. In the sequel we assume 
WLOG that each pair of neighboring databases differ on their last entry. 

Intuitively the more an 'interesting^] mechanism M is perturbed to guarantee differential privacy, the less 
like M the resulting mechanism M will become. The next definition formalizes the notion of 'likeness'. 

Definition 2. Consider two mechanisms M and M with the same domain and response spaces T m ,Tm 
respectively. Let X be some set and let J 7 be a space of real- valued functions on X that is parametrized 
by the response spaces: for every t £ T M U Tm let f t £ T be some function. Finally assume T is endowed 
with norm || ■ ||jr. Then for e > and < 5 < 1 we say thalj^M is (e, 8)-useful with respect to M if, for all 



databases D, Pr i 



JM 



M(D) 



TM(D) 



T 



< e )> 6. 



Typically M will be a privacy-preserving version of M, that has been perturbed somehow. Usefulness 
means that not only does M guarantee privacy of the training database, but that the aggregate information 
revealed about the database by M is 'close' to what would be revealed by the desired (but non-private) 
mechanism M. In the sequel we will take || • \\jr to be the sup-norm over a subset Ai C R d containing 
the data, which we denote by ||/||oo;A4 = su P x ewi l/( x )l- It will also be convenient to use the notation 
ll^l|oo;Ai — su P X yeA4 IM x ;y)l f° r bivariate functions k(-, ■). 

Remark 3. In the sequel we develop privacy-preserving mechanisms that are useful with respect to the 
Support Vector Machine (see the next section for a brief introduction to the SVM). The SVM works to 
minimize the expected hinge-loss (i.e., risk in terms of the hinge-loss), which is a convex surrogate for the 
expected 0-1 loss. Since the hinge-loss is Lipschitz in the real-valued function output by the SVM, it follows 
that a mechanism M having utility with respect to the SVM also has expected hinge-loss that is within 



1 Examples of interesting properties include low risk, robustness to a small amount of malicious noise, e tc 



Blum et al. 



(2008) for non-interactive 



Note that we have chosen to overload the term (e, <5)-usefulness introduced by 
mechanisms that release anonymized data. Our definition of usefulness is analogous for the present setting of privacy-preserving 
learning, where a single function is released. 
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Algorithm 1 SVM 



I: convex loss 



Inputs: database D = {(xj,yj)}™ =1 with x, £ M d , Vi £ {-1, 1}; kernel k : R d x R d 
function l\ parameter C > 0. 

(1) a* «— Solve the QP dual of Primal (2.1 1 (see e.g., the derivations by Bishopp006 l; and 

(2) Return vector a* . 



e of the SVM's hinge-loss with high probability. That is, (e, £)-usefulness with respect to the sup-norm is 
stronger than guaranteed closeness of risk (absolute bounds on risk for regularized logistic regression are 
explored by |Chaudhuri and Mon tcleoni 2009, Kasiviswanathan ct al. 2008 consider the task of private PAC 
learning, which demands closeness of risk) . We consider the hinge- loss further in Sections [5] and [6j Until 
then we work with arbitrary convex, Lipschitz losses. 

We will see that the presented analysis does not simultaneously guarantee privacy at arbitrary levels and 
utility at arbitrary accuracy. The highest level of privacy guaranteed over all (e, <5)-useful mechanisms with 
respect to a target mechanism M, is quantified by the optimal differential privacy for M. We define this 
notion for the SVM here, but the concept extends to any target mechanism of interest. We present upper 
and lower bounds on (3(t,5,C,n,l,k) for the SVM in Sections [5] and [6] respectively. 

Definition 4. For e, C > 0, 5 £ (0, 1), n > 1, loss function £(y,y) convex in y, and kernel k, the optimal 
differential privacy for the SVM is the function 



/3(e, S, C, n, i, k) = inf sup sup log 



'Pr (m(Di) 



Mei (dAjec (er^ \Pr M(D 2 ) = t 



(M{D 2 ) 



where X is the set of all (e, <5)-useful mechanisms with respect to the SVM with parameter C, loss I, and 
kernel k; and V is the set of all pairs of neighboring databases with n entries. 

2.1. Background on Support Vector Machines. Soft-margin SVM learning corresponds to the convex 
Primal program 

(2-1) min £||w||! + £f^( W ,/ w (x i )) , 

i— 1 

where the Xj £ M. d are training input points and the yi £ {—1,1} are their training labels, n is the size of the 
training set, : M. d — ► R F is a feature mapping taking points in input space E d to some (possibly infinite) 
-F-dimensional feature space, £(y, y) is a loss function convex in y, and w is a hyperplane normal vector in 
feature space. 

When F is finite, predictions are made by taking the sign of /*(x) = / w *(x) = (</>(x), w*). We will refer 
to both /w(-) and sgn(/ w (-)) as classifiers, with the exact meaning apparent from the context. When F is 
large and when inner-products in feature space may be computed quickly via an explicit representation of the 
kernel function fc(x, y) = (</>(x), </>(y)), the solution may be more easily obtained via the dual. For example, 
see Program ( |5.1[ l in Section [5] for the dual formulation of the hinge-loss £(y,y) = (1 — yy)+, which is the 
loss most commonly associated with soft-margin SVM. Other examples include the square loss (1 — yy) 2 
and logistic loss log (1 + exp (—yy)). The vector of maximizing dual variables a* returned by dualized SVM 
parametrizes the function /* = f a * as /«(•) = X)2=i a iyik(-,*-i)- 

More generally, the Support Vector Machine can be seen as performing regularized ERM in a Reproducing 



Kernel Hilbert Space (RKHS) TL. The Representer Theorem (Kimeldorf and Wahba 19711 states that the 
minimizing /* = arg min^ g ^ ^II/IIh + S™=i ^{Uit /( x i)) ^ es m the span of the functions fc(-,Xj) £ TL. 
Indeed the above dual expansion shows that the coordinates in this subspace are given by the a*yi. 

We define the mechanism SVM to be the dual optimization that responds with the vector a.*, as described 
by Al gorithm [T] For general information about SVMs see e.g., (Burges 1998 Cristianini and Shawe-Taylor 



class of kernels (see Table [lj for examples). 



2000 Scholkopf and Smola 2001| Bishop| 20061. We end this section with the definition of an important 



Definition 5. A kernel function of the form fc(x, y) = g(x — y), for some function g, is called translation- 
invariant. 
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Kernel 


3(A) 


p(uj) 


RBF 




(2^)- rf / 2 exp(-M^) 


Laplacian 


expC-HAlU) 


Y\ d 1 

lli=l 


Cauchy 


Y\ d 2 
LU=l i+A? 


exp(-||A||i) 



Table 1. Example translation- invariant kernels, their g functions and the corresponding 
Fourier transforms. 



3. Mechanism for Finite Feature Maps 



As a first step towards private SVM learning we begin by considering the simple case of finite F- 
dimensional feature maps. Algorithm [2] describes the PrivateSVM-Finite mechanism, which follows 
the usual pattern of preserving differential privacy: after forming the primal solution to the SVM — an 
F-dimensional vector — the mechanism adds Laplace-distributed noise to the weight vector. Guaranteeing 
differential privacy proceeds via the usual two-step process of calculating the Li-sensitivity of the SVM's 
weight vector, then showing that /3-differential privacy follows from sensitivity together with the choice of 
Laplace noise with scale equal to sensitivity divided by /3. 

To calculate sensitivity, we exploit the algorithmic stability of regularized ERM. Intuitively, stability cor- 
responds to continuity of a learning map. Several notions of stability are known to lead to good generalization 



error bounds (Devroye and Wagner 1979 Kearns and Ron 1999 Bousquet and Elisseeff 2002 Kutin and 



Niyogi 2002 I , sometimes in cases where class capacity-based approaches such as VC theory do not apply. A 



learning map A is a function that maps a database I? to a classifier fo', it is precisely the composition of a 
mechanism followed by the classifier parametrization mapping]^] A learning map A is said to have j-uniform 
stability with respect to loss £(■, •) if for all neighboring databases D, D' , the losses of the classifiers trained 
on D and D' are close on all test examples ||£(-, A(D)) — (!(•, A(D'))\\ 00 < 7 (Bousquet and Elisseeff 2002). 
Our first lemma computes sensitivity by following the proof of ( |Scholkopf and Smola 2001 Theorem 12.4) 
which establishes that SVM learning has uniform stability (a result due to Bousquet and Elisseeff 2002). For 
simplicity we restrict the proof of sensitivity to differentiable loss functions in Lemma |6j the result remains 
the same for general convex loss functions. See Lemma [2T| for an almost identical proof for subdifferentiable 
losses. 

Lemma 6. Consider loss function £(y, y) that is differentiable, convex and L-Lipschitz in y, and an RKHS TL 
induced by finite F -dimensional feature mapping <\> with bounded norm fc(x, x) < k 2 for all x S M. d . Let wg £ 
R F be the minimizer of the following regularized empirical risk function for each database S — {(x^, 2/i)}" =1 



i? reg (w, S) 



r ^ 1 

*(lft,/w(xO) + d|w|| 



i=l 



Then for every pair of neighboring databases D,D' of n entries, \\wo — wd'\\i < ^LCk^/F jn. 

Proof. For convenience we define i? cmp (w, S) — rT 1 X)"=i ^ (Vi> /w(xi)) for any database S, then the first- 
order necessary KKT conditions imply 

(3.1) a w i? rcg (w r ,,D) = C<9 w i? cmp (wD,£>) + w D = 

(3.2) D^Rre^WD^D') = Cd w R cmp (w D ,,D') + w D , = , 

where <9 W is the partial derivative operator with respect to w. Define the auxiliary risk function 

^(w) = C(d w R^ mp (w D ,D)-d w R^ mp (w D ',iy),w-vf D >) + h\w-wn'\\l- 

It is easy to see that i?(w) is strictly convex in w and that R(wd') — 0. And since by Equation (3.2) 
d w R(w) = Cd w R CI[Lp (w D ,D) - Ca w i? cmp (w D -, D') + w - w D , 
= Cd w R emp (w D , D) + w , 



^For example an SVM mechanism may return a weight vector w* or dual coefficients a* which in turn parametrizes the 
classifier /*. The SVM learning map takes the training database directly to the classifier. 



fi 
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Algorithm 2 PrivateSVM-Finite 

Inputs: database D = {(xj,yi)}™_ 1 with x, S R d , 2/j S { — 1, 1}; finite feature map cj> : K" — > ]R F and 
induced kernel fc; convex loss function and parameters A, C > 0. 

(1) a* «— Run Algorithm [T] on D with parameter C, kernel k, and loss £; 

(2) w<-£r=i a iW( x i); 

(3) <— Draw i.i.d. sample of F scalars from Laplace (0, A); and 

(4) Return w = w + fi 



it follows that i?(w) is minimized at w^i by Equation (3.1 1. Thus R(w£>) < 0. Next simplify the first term 
of i?(w£>), scaled by n/C for simplicity: 

n(<9 w i? cmp (w, D ,L>) - 9 w i? omp (w_ D /,£)'), w_d -w D ,) 

n 

= ^(d^l (y l J v/D (x l )) - d w £ (y'iJw^ix'i)) , w D - w D >) 
i=\ 

n-l 

= E (Wi./w B (*i)) - * /w D ,(x,))) (/ WB (Xi) - /w D ,(x t )) 

+^ (y„,. /w D (x„)) (/ WD (x„) - / WB ,(x„)) -£' {y' n ,U D ,{<)) - /w D ,(x'J) 

> ^ (j/ n , /wd W) (/wd W " /w D , (Xt.)) ~ £' {y'nJvr D >( X 'n)) (/wd CO ~ /w D , CO) ) 

where £'(y,y) = d$£(y,y). The second equality follows from d vr £(y,f- w (x)) = / w (x)) 0(x) and x- = x, 
and = j/, for each is [n — 1], and the inequality follows from the differentiability and convexit jj^] of I in y. 
Combined with i?(w£>) < this yields 

n II 112 

< * {y' n , U DI «)) (UnK) ~ U D , «)) - ^ (yn,/w D (x„)) (/w D (x„) - / Wd , (x„)) 

(3.3) < 2L||/ WJJ -/ WD ,|| oo , 

by the Lipschitz continuity of I. Now by the reproducing property and Cauchy-Schwartz inequality we can 
upper bound the classifier difference's infinity norm by the Euclidean norm on the weight vectors: for each 
x 

|/w D (x) - /w D ,(x)| = |(0(x),W£> - W2J/}| 

< ll^(x)|| 2 ||W£) - W^|| 2 
= v/fc(x7x) ||wd - VfD'\\ 2 

< K \\W D - W D /|L . 



Combining this with Inequality (3.3 1 yields \\wjj — W£>/|| 2 < ALCn/n. Li-based sensitivity then follows from 
the inequality ||w||i < V^||w|| 2 for all w e R F . □ 

With the weight vector's sensitivity in hand, differential privacy follows immediately from the proof 
technique established by Dwork et al. (2006}. 

Theorem 7 (Privacy of PrivateSVM-Finite). For any (3 > 0, database D of size n, C > 0, loss function 
£(y,y) that is convex and L-Lipschitz in y, and finite F -dimensional feature map with kernel fc(x, x) < k 2 
for all x 6 R d , PrivateSVM-Finite run on D with loss £, kernel k, noise parameter A > 4LC kVf / {(3ri) 
and regularization parameter C guarantees [3 -differential privacy. 

This first main result establishes the usual kind of differential privacy guarantee for the new PrivateSVM- 
Finite algorithm. The more "private" the data, the more noise must be added. The more entries in the 
database, the less noise is needed to achieve the same level of privacy. Since the noise vector fi has exponential 
tails, standard tail bound inequalities quickly lead to (e, <5)-usefulness for PrivateSVM-Finite. 



Namely for differentiable convex / and any a, b e K, (/'(a) — /'(b)) (a — b) > 0. 
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Theorem 8 (Utility of PrivateSVM-Finite). Consider any C > 0, n > 1, database D of n entries, 
arbitrary convex loss £, and finite F -dimensional feature mapping (j) with kernel k and |<^(x)j| < $ for all 
x <E M and i e [F] for some $ > and M C R d . For any e > and 6 € (0, 1), PrivateSVM-Finite run 



on D with loss I, kernel k, noise parameter < A < 
(e, S) -useful with respect to the SVM under the 



2*(Flog c 2+log e i) 

ao-M -norm. 



and regularization parameter C , is 



Proof. Our goal is to compare the SVM and PrivateSVM-Finite classifications of any point x 6 M: 



/m(/j)( x ) _ /m(/j)(x) 



< 
< 



|<w,#c)>- 

IImIIiII0(x)|| 
*IImIIi • 



(w,0(x))| 



The absolute value of a zero mean Laplace random variable with scale parameter A is exponentially dis- 
tributed with scale A -1 . Moreover the sum of q i.i.d. exponential random variables has Erlang ^-distribution 
with the same scale parameter^ Thus we have, for Erlang F-distributed random variable X and any t > 0, 



Vx e M, 



/m(b)( x ) ~ /m(d)(x) 



Ve > 0, Pr 



Jm 



M(D) 



fM(D) 



ao;M 



(3.4) 



< $V 



>e < Pr(AT>e/$) 



< 



Pr ( e tx > e t£ /* 



E [e tx ] 



et/* 



Here we have employed the standard Chernoff tail bound technique using Markov's inequality. The numerator 
of (3.4 1 , the moment generating function of the Erlang F-distribution with parameter A, is (1 — Xt)~ F for 



all t < A 1 . Together with the choice of t = (2A) 1 , this gives 



Pr 



M(D) 



-f. 



M(D) 



> e 



oa;M 



< 



(l-A<)- F e- e */* 



= 2 F e~ e/(2A<t,) 
= exp(Flog e 2-e/(2A$)) 
And provided that A < e/ (2$ (Flog e 2 + log e i)j this probability is bounded by i5. 



□ 



Our second main result establishes that PrivateSVM-Finite is not only differentially private, but that 
it releases a classifier that is similar to the SVM. Utility and privacy are competing properties, however, 
since utility demands that the noise not be too large. 



4. Mechanism for Translation-Invariant Kernels 

Consider now the problem of privately learning in an RKHS TC induced by an infinite dimensional feature 
mapping (p. As a mechanism's response must be finitely encodable, the primal parametrization seems less 
appealing as it did in PrivateSVM-Finite. It is natural to look to the SVM's dual solution as a starting 
point: the Representer Theorem (Kimeldorf and Wahba 19711 states that the optimizing /* 6 H must 



be in the span of the data — a finite-dimensional subspace. While the coordinates in this subspace — the a* 
dual variables — could be perturbed in the usual way to guarantee differential privacy, the subspace's basis — 
the data — are also needed to parametrize /*. To side-step this apparent stumbling block, we take another 
approach by approximating H with a random RKHS Ti induced by a random finite-dimensional map (j>. This 
then allows us to respond with a finite primal parametrization. Algorithm [3] summarizes the PrivateSVM 
mechanism. 



As noted recently by Rahimi and Recht (2008), the Fourier transform p of the g function of a continuous 



positive-definite translation-invariant kernel is a non-negative measure (Rudin 19941. Rahimi and Recht 



5 The Erlang ^-distribution has density — 



1 exp( — x/ A) 
A9( 9 -l)! ' 



CDF 1 



-i (x/xy 

=0 j! 



expectation qX and variance qX 2 . 
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Algorithm 3 PrivateSVM 



Inputs: database D = {(xj,yi)}™_ 1 with x, S Mr, yi £ {— 1, 1}; translation-invariant kernel 

fc(x, y) = <?(x — y) with Fourier transform p(u>) = 2~ 1 J e~^ u '' x ' g(x) dx; convex loss function I; parameters 

A, C > and d G N. 

(1) p x ,...,pz <— Draw i.i.d. sample of d vectors in K d from p; 

(2) a <— Run Algorithm [T] on Z? with parameter C, kernel fc induced by map (4.11, and loss l\ 

(3) w <— X)"=i Vi^ii 1 ( x i) where </> is defined in Equation (4.1 1; 

(4) p <— Draw i.i.d. sample of 2d scalars from Laplace (0, A); and 

(5) Return w = w + p and p 1 , . . . , p^ 



(20081 exploit this fact to construct a random finite-dimensional RKHS Ti by drawing d vectors from p. 
These vectors p 1 , . . , , p ? define the following random 2d-dimensional feature map 

(4.1) = d- l / 2 [cos(( Pl ,-)), S m(( Pl ,-)), ...,cos«p d v)),sin(( P(i v»] T ■ 

Inner-products in the random feature space app roximate •) uniformly, and to arbitrary precision depend- 



ing on parameter d, as re s tated in Lemma 13 We denote the inner-product in the random feature space 
by k. Rahimi and Recht (20081 applied this approximation to large-scale learning (situations where n is 
large). Instead of employing non-linear SVM's dual solution which takes 0(n 2 ) time, the primal solution to 
linear SVM on <f> is used, as it takes time quadratic in d to compute. For large-scale learning, good approx- 
imations can be found for d <C n. Table |l] presents three important translation-invariant kernels and their 
transformations. PrivateSVM employs the same trick for translation-invariant kernels, but in a different 
setting. Here regularized ERM is performed in Tt, not to avoid complexity in n, but to provide a direct finite 
representation w of the primal solution in the case of infinite dimensional feature spaces. After performing 
regularized ERM in H, appropriate Laplace noise is added to the primal solution w to guarantee differential 
privacy as before. 

PrivateSVM is computationally efficient. Algorithm [3] takes O(d) time to compute each entry of the 
kernel matrix, or a total time of 0(dn 2 ) on top of running dual SVM in the random feature space which is 
worst-case O(n^) for the analytic solu tion (where n s < n is the number of support vectors), and faster using 



numerical methods such as chunking (Burges 19981. To achieve (e, <5)-usefulness wrt the hinge-loss SVM d 



must be taken to be O (-4 (log 4 + log -)) (cf. Corollary 151. By comparison it takes 0(dn 2 ) to construct 
the kernel matrix for any translation-invariant kernel. 

As with the SVM and PrivateSVM-Finite, the response of Algorithm [3] can be used to make clas- 
sifications on future test points by constructing the classifier /*(•) = /•*•(•) = (w, </>(•)). Unlike the pre- 
vious mechanisms, however, PrivateSVM must include a parametrization of feature map — the sample 

{Pi} i=1 — in its response. Of PrivateSVM's total response, only w depends on database D. The p i are 
data-independent vectors drawn from the transform p of the kernel, which we assume to be known by the 
adversary (to wit the adversary knows the mechanism itself, including k). Thus to establish differential 
privacy we need only consider the data-dependent weight vector, fortunately we have already considered the 
similar case of PrivateSVM-Finite. 

Corollary 9 (Privacy of PrivateSVM). For any (3 > 0, database D of size n, C > 0, d 6 N, loss function 
£(y,y) that is convex and L-Lipschitz in y, and translation-invariant kernel k, PrivateSVM run on D 

with loss t, kernel k, noise parameter A > 2 2 ' 5 LC\/~d/((3n), approximation parameter d, and regularization 
parameter C guarantees (3- differential privacy. 

Proof. The result follows immediately from Theorem [7] since w is the primal solution of SVM with kernel k, 
the response vector w = w + p for i.i.d. Laplace p, and fc(x,x) = 1 for all x £ R D . □ 

This result is surprising, in that PrivateSVM is able to guarantee privacy for regularized ERM over a 
function class of infinite VC-dimension, where the obvious way to return the learned classifier (responding 
with the dual variables and feature mapping) reveals all the entries corresponding to the support vectors, 
completely. 
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Like PrivateSVM-Finite, PrivateSVM is useful with respect to the SVM. If we denote the function 
parametrized by intermediate weight vector w by /, then the same argument for the utility of PrivateSVM- 
Finite establishes the high-probability proximity of / and /*. 

Lemma 10. Consider a run of Algorithms^^and^with d £ N, C > 0, convex loss and translation-invariant 
kernel. Denote by f* and f the classifiers parametrized by weight vectors w and w respectively, where these 
vectors are related by w = w + fi with fi *~ Laplace(0, A) in Algorithm [$| For any e > and 5 £ (0, 1), if 



< A < min <, e - r , a f va , )■ then 



Pr 



/*-/ 



< - > 1- - 
~ 2J ~ 2 



Proof. As in the proof of Theorem [8] we can use the Chernoff trick to show that, for Erlang 2d-distributed 
random variable X, the choice of t — (2A) ,and for any e > 

E [e tx ] 



Pr ( /* - / > e/2) < 



< 



e etVa/2 

(1 - Xty 2i e- et ^' 2 

2 2d e -eV / d7(4A) 

exp (dlog e 4-e\/d/(4A) 



Provided that A < e/ (^2 4 log e 2y/Tj this is bounded by exp (-e\/d/(8A)) . Moreover if A < eVd/ (81og e |), 
then the claim follows. □ 

To show a similar result for /* and /, we exploit smoothness of the regularized ERM with respect to 
small changes in the RKHS itself. To the best of our knowledge, this kind of stability to the feature mapping 
has not been used before. We begin with a technical lemma that we will use to exploit the convexity of the 
regularized empirical risk functional. 

Lemma 11. Let R be a functional on Hilbert space 7i satisfying R[f] > i£[/*] + §||/ — /*||^ for some a > 0, 
f* eH and all f€H. Then R[f] < R[f*] + e implies \\f - f*\\ft < J^, for alle>0, f £ H. 

Proof. By assumption and the antecedent 

\\j-n\l < - a {R\f]-R[n) 

< l(R[f*] +e -R[r]) 
= 2e/a . 

Taking square roots of both sides yields the consequent. □ 

Provided that the kernel functions k and k are uniformly close, the next lemma exploits insensitivity 
of regularized ERM to perturbations of the feature mapping to show that /* and / are pointwise close. 
Lemma [22] re-proves this result for non-differentiable loss functions. 

Lemma 12. Let Ti be an RKHS with translation-invariant kernel k, and let Ti be the random RKHS 
corresponding to feature map (4.1 1 induced by k. Let C be a positive scalar and loss £(y,y) be differentiable, 
convex, and L-Lipschitz in y. Consider the regularized empirical risk minimizers in each RKHS 

f* £ argmin C ^% l ,/(x l )) + \ \\ff H _ 

i—l 

C -J 1 ^ 1 

g* £ argmm — >J%i,fl'(x i )) + -\\g\\^, ■ 
gen n 2 « 
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Let M C R d be any set containing xi,...,x„. For any e > 0, if the dual variables from both optimiza- 



tions have Li-norms bounded by some A > and 
\\f*-9*\\oo:M<e/2- 



k-k 



< min < 1 , — -. £ — r-; 

oo-M I 2 2 [h.+2y/(CL+K/2)Kj 



then 



Proof. Denote the empirical risk functional by i? om p[/] = n 1 X)"=i ^ Cl/* > /( x i)) an d the regularized empirical 
risk functional R rcg [f] = C R cmp [f] + ||/|| 2 /2, for the appropriate RKHS norm (either Tt or Ti). Let /* 
denote the regularized empirical risk minimizer in H, given by parameter vector a*, and let g* denote the 
regularized empirical risk minimizer in Ti. given by parameter vector (3* . Let g a * — y'., a*?/i(/)(x i ) and 
fp* = S"=i PiVi ( f ) { yL i) denote the images of /* and g* under the natural mapping between the spans of the 
data in RKHS's Ti. and Ti respectively. We will first show that these four functions have arbitrarily close 
regularized empirical risk in their respective RKHS, and then that this implies uniform proximity of the 
functions themselves. First observe that for any g E Ti 



RrcJg] — C R cmp [g] 



> C(d g R emp [g%g- g*)^ +CR m9 \g*\ + -\\g\fy 
= (9 g Rli e [9*],9 ~ 9*)n~ (9*, 9 ~9*)h + CR emp [g*] 



" ,*l|2 1 1 II „||2 1 ll„*l|2 



CR cmp [g*} + -\\g\fa - (g^g-g*)^ 



\\9\h 



CR emp [g*] + -\\g*\\^ + -\\gWb - - {g*,g- g*)^ 



< g [5l+2NI^ 



(9*,9)n 



\9% 



K eg [9*} + ^\\9-9% , 



The inequality follows from the convexity of i? cm p[']; the subsequent equality by d g R^ g [g] = C d g R, 



the third equality by d g R.f^ g [g*] = 0; and the remainder by gathering terms. With this, Lemma [Tl 



9} + .9; 
states 



that for any g S Ti and e' > 0, 



(4.2) 



H 



< V2e' 



xe M 



Conditioned on | 






, for 






oo;A4 J 





(4.3) 



n 

|/*(x) -g a *(x)\ = ^2a*yi ffc(xj,x) - fe(x i5 x) 

i=l 
n 

i=l 

< e'IKIIi 

< e'A, 
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by the bound on ||a*||i. This and the Lipschitz continuity of the loss leads to 



I- r*ii2 1„„ „2 



CR emp [f*] - C R cmp [g a *} + -\\.f*\\ H - -\\g a - u-fi 



^ /*(*)) -t(Vi,9 a *{xi))\ + \ 

*' (K - k) a* 



a* K - K a* 



C J\ 1 

i=l 



< CLUr-^IU^ + illa*^ 



< C-i||/*-^IL ;M +2l|a*||?e' 

< CLe'A + A 2 e'/2 
A N 



K - K a* 



Similarly, 



Ae' . 



-Rrcgbl ~ ^reg[//3*] < (CL + A/2)Ae' by the same argument. And since R^L[fp*] > R™Jf*\ and 



R? es {g a *} > R? es [9*] we have proved that I^.\g a *] < i?« g [/*] + (CL+A/2)Ae' < i?^ g [/ /3 *] + (CL+A/2)Ae' < 
#reg[5*] + 2 (C L + A/2)Ae'. And by implication jilf , 

(4.4) 



2i CL 



A 



Ae' 



Now fc(x, x) = 1 for each x £ R d implies 



- 9*, fc(x, •) 



< 11.9a* - .g*||-H V fc ( x ' x ) 

= 115a* -9*\\n > 



This combines with Inequality (4.4 1 to yield 



Il5a*-5*IL;A< < 2, CI 



A 



Ae' 



Together with Inequality ( |43| this finally implies that ||/* — fl , *ll C o-^vi ^ e ' A + 2 V / (CL + A/2) Ae', con- 

k — k < e' > . For desired accuracy e > 0, conditioning on event A t i with 

*. oo J 

e/ [2 (A + 2V(CL + A/2)A)] ,e 2 / [2 (a + 2 V(CL + A/2) a)]*} yields bound ||/*-$*IL ; A< < 



ditioned on event A f / 



e = mm < e 



e/2: if e' < 1 then e/2 > v/e 7 (A + 2^(CL + A/2) AJ > e'A + 2^/{CL + A/2) Ae' provided that e' < 
e 2 / 2 ^A + 2 V /(CI + A/2) a) 2 . Otherwise if e' > 1 then we have e/2 > e' (A + 2^J(CL + A/2) a) > 

e'A + 2v/(CL + A/2) Ae' provided e' < e/ 2 ^A + 2y / (CL + A/2) A 
min {l,iJ 2 }, the result follows. 



. Since for any H > 0, min {ff, iJ 2 } > 

□ 



We now recall the result due to Rahimi and Recht (20081 that establishes the non-asymptotic uniform 
convergence of the kernel functions required by the previous Lemma (i.e., an upper bound on the probability 
of event A e > ) . 



Lemma 13 (Rahimi and Recht 2008 Claim 1). For any e > 0, S £ (0,1), translation-invariant kernel k 
and compact set Ai C K d , if d > 4 ^j" 2 ^ log e ^ 2 fad™(-M)) ^ ^ then Algorithm j^j's random feature mapping <f> 
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defined in Equation (4.1 1 satisfies Pr 



< ej > 1 — 5, where a 2 = E is the second moment 



of the Fourier transform p of k's g function. 

Combining these ingredients establishes utility for PrivateSVM. 

Theorem 14 (Utility of PrivateSVM). Consider any database D, compact set A4 C M. d containing D, 
convex loss £, translation-invariant kernel k, and scalars C, e > and 6 £ (0, 1). Suppose the SVM with loss 
I, kernel k and parameter C has dual variables with L\-norm bounded by A. Then Algorithm^run on D with 

loss £, kernel k, parameters d > log e ^ 2 ^ p 



„diam(A4))" 
89(e) 



wh 



ere 



X < min , - . . , 

2*log c 2Vrf 81og ' 

k and parameter C, wrt the 




- ( and C is (e, 8)-useful with respect to Algorithm 
Woo-M-norm. 



2 4 (a+2^/(CL+A/2)a) J 

run on D with loss £, kernel 



Proof. Lemma's 12 and 10 combined via the triangle inequality, with Lemma 13 together establish the result 
as follows. Define A to be the conditioning event regarding the approximation of k by k, denote the events in 
Lemma's 12 and|8]by B and C (beware we are overloading C with the regularization parameter; its meaning 
will be apparent from the context), and the target event in the theorem by D. 



A 



< min < 1, — . 

"^■ M I 2 2 (a+2 v /(cl+4)a) 



B 



/*-/ 



oo;M 



< e/2 



C 



f*-f 



< € 



ft} 



D 



/*-/* 



< € 



oo\M 



The claim is a bound on Pr(D). By the triangle inequality events B and C together imply D. Second 
note that event C is independent of A and B. Thus PiJD \ A) > Pr(B n C \ A) = Pr(B | A) Pr(C) > 



1 ■ (1 — (5/2), for sufficiently small A. Finally Lemma 13 bounds Pr(A) as follows: provided that d > 



4(d + 2) log e (2 9 (crpdiam(TW)) 2 / (69(e))) /9(e) where 9(e) = min |l, e 4 / 2 (a + 2^(C*L + A/2) a) 
have Pr(A) > 1 - 6/2. Together this yields Pr(L>) = Pr(£> | A) Pr(A) > (1 - 6/2) 2 > 1 - 6. 



we 

□ 



Again we see that utility and privacy place competing constraints on the level of noise A. Next we will 
use these interactions to upper-bound the optimal differential privacy of the SVM. 



5. Hinge-Loss and an Upper Bound on Optimal Differential Privacy 

We begin by 'plugging' hinge loss £(y,y) = (1 — yy)+ into the main results on privacy and utility of 
the previous section (similar computations can be done for PrivateSVM-Finite and other convex loss 
functions). The following is the dual formulation of hinge-loss SVM learning: 

n ^ n n 

(5.1) max V" a % - - V" onctjyiyjkfa, Xj) 

i=l i=l j = l 

C 

s.t. < ati < — Vi e [n] . 

n 

Corollary 15. Consider any database D of size n, scalar C > 0, and translation-invariant kernel k. For 

any (5 > and d € N, PrivateSVM run on D with hinge loss, noise parameter X > — „^ , approximation 

parameter d, and regularization parameter C , guarantees (3 -differential privacy. Moreover for any compact 
set M. C M. d containing D, and scalars e > and 6 € (0, 1), PrivateSVM run on D with hinge loss, kernel 

k, noise parameter X < min / — —j=, g ] > approximation parameter d > 4 g^ 2 ^ log e ^ 2 (o~pdiain(A4)) 

with 9(e) = min/l, 2 i| g4 X, and regularization parameter C, is (e, 6)-useful wrt hinge-loss SVM run on D 
with kernel k, and parameter C 
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Proof. The first result follows from Theorem [7] and the fact that hinge-loss is convex and 1-Lipschitz on K: 
i.e., dy£ = 1[1 > yi/] < 1. The second result follows almost immediately from Theorem 14 For hinge-loss 
we have that feasible aj's are bounded by C/n (and so A = C) by the dual's box constraints and that L = 1, 



implying we take 9(e) = min jl, 2 ic' L (i+V6) 4 ' J ' ^ s bounded by the stated 9(e). □ 

Combining the competing requirements on noise level A upper-bounds optimal differential privacy of 
hinge-loss SVM. 

Theorem 16. The optimal differential privacy for hinge-loss SVM learning on translation-invariant kernel 



k is bounded by (3(e, S, C, n, l,k) = (j^ y^log }- t (log \ + log 



2 J_ 

5e 



Proof. Consider hinge loss in Corollary 15 Privacy places a lower bound of (3 > 2 2 5 Cv d/(\n) for any 
chosen A, which we can convert to a lower bound on (3 in terms of e and S as follows. For small e, we have 
9(e) = e 4 2~ 12 C -4 and so to achieve (e, <5)-usefulness we must take d = O (^log e (jjt))- There are two 

cases for utility, if A = e/ (2 4 log e (^Vdj) then p = O f ^^ VA = q ^/log^ (log i + log 2 A 



St . 



Otherwise we are in the second case, with A = gi^g yielding f3 = O (~ log |) which is dominated by the 
first case as e J. 0. □ 

A natural question arises from this discussion: given any mechanism that is (e, <5)-useful with respect to 
hinge SVM, for how small a (3 can we possibly hope to guarantee /3-differential privacy? In other words, 
what lower bounds exist for the optimal differential privacy for the SVM? 



6. Lower Bounding Optimal Differential Privacy 

To lower bound j3 for any (e, (5)-useful mechanism, we first establish a negative sensitivity result for the 
SVM, by constructing two neighboring databases on which SVM classifiers differ. 

Lemma 17. For any C > 0, n > 1 and < e < there exists a pair of neighboring databases D\,D2 

on n entries, such that the functions /*, f£ parametrized by SVM run with parameter C, linear kernel, and 
hinge loss on D 1 ,D 2 respectively, satisfy \\f* — f£ \\ > 2e. 

Proof. We construct the two databases on the line as follows. Let < m < M be scalars to be chosen later. 
Both databases share negative examples X\ = . . . = x^ n / 2 j = —M and positive examples xy n / 2 \+\ = . . . = 
%n-i — M- Each database has x n = M — m, with y n — —1 for Di and y n = 1 for D 2 . In what follows we 
use subscripts to denote an example's parent database, so (xij,yij) is the j th example from Di. Consider 
the result of running primal SVM on each database 

1 C - 
w* = argmin -w 2 -\ } (1 - yi ,iWX lti ), 

1=1 

1 C ^ 
w 2 = argmin -w 2 -\ } (1 - y2,iWX2,i), . 

i=l 

Each optimization is strictly convex and unconstrained, so the optimizing w*,w 2 are characterized by the 
first-order KKT conditions E d w fi(w) for /j being the objective function for learning on D i7 and d w 
denoting the subdifferential operator. Now for each i £ [2] 

C n 

d w f t (w) = w ^ H'..r r >.j 1 1 j ■ 

where 

( {0} , if x < 
l[x] = | [0,1] , if x = 
{{1} , if x > 
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Figure 6.1. For each i £ [2], the SVM's primal solution w*on database Di constructed in 



the proof of Lemma 17 corresponds to the crossing point of line y = w with y — w — d w fi(w). 
Database D\ is shown on the left, database D2 is shown on the right. 



is the subdifferential of (x) + . Thus for each i G [2], w* £ ^ 2j=i Hi,j x i,j^- [•"■ — Di,j w t x i,j\ which is equivalent 
to 



CM(n - 1) - 



CM(n - 1) j 



1 

M 
1 

M 



C(m-M)- 
C(M - m) - 



1 



to - M 



1 



A/ — TO 



The RHSs of these conditions correspond to decreasing piecewise-constant functions, and the conditions are 
met when the corresponding functions intersect with the diagonal y — x line, as shown in Figure |6| If 

C(M(n-2)+m) < 1_ ^ ^ = C(M(n-2) +m ) _ ^ [{ C(Mn-m) < X_ ^ ^ = C (Mn-m) _ g provided that 



M 

J_ C(Mrc-m) _ J C(M(»-2)+m) C(M; 



and m = 5f , this implies 



provided e < ^2. 



n— m) 1 
n J ' 



we have 



2C 



^ |M - m|. So taking M = % e 



I/1-/2IL > -#(1)1 
= 2e, 



□ 



Theorem 18 (Lower bound on optimal differential privacy for hinge loss SVM). For any C > 0, n > 1, 

(5 G (0, 1) and e £ ^0, ^2 V i/ie optimal differential privacy for the hinge-loss SVM with linear kernel is 
lower-bounded by log e ^j^-- In other words, for any C, (3 > and n > \ if a mechanism M is (e, 8)-useful 
and P -differentially private then either e > or 5 > exp(— (3). 

Proof. Consider (e, 5)-useful mechanism M with respect to SVM learning mechanism M with parameter 
C > 0, hinge loss and linear kernel on n training examples, where S > and > e > 0. By Lemma [Tt| 
there exists a pair of neighboring databases Dx, D2 on n entries, such that ||/* — ||oo > 2e where /* = /m(D;) 
for each i £ [2]. Let /j = ffyi D ^ for each i e [2]. Then by the utility of M, 

(6-1) Pr (A G Bf (A*)) > 1-5, 

(6-2) Pr (/ a G B e °° (A*)) < Pr (/ 2 i B? (/*)) < 5 . 

Let Vx and "P 2 be the distributions of M(Dx) and M(D 2 ) respectively so that Vi{t) = Pr (M(A) = 



Then by Inequalities (6.1 1 and ( |6.2 

rfP 2 (T) 
.dPi(T) 



E 



T G {fx*) 



< 
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Thus there exists a t such that log Pr ( A j f ( Dl ) *) > j „ J_i □ 

6 Pr(M(D 2 )=t) - 6 <5 

The same technique can be extended to prove a stronger lower bound. First we construct a set of N > 1 
neighboring databases having SVM images that are a 2e-packing. To achieve this for any N we move from 
linear to RBF kernel. 



Lemma 19. For any C > 0, n > C, < e < and < a < \J 2 \og 2 ^ ere exists a set of N 



a V log c 2 



pairwise-neighboring databases {Di}^ =1 on n examples, such that the functions f* parametrized by hinge-loss 
SVM run on Di with parameter C and RBF kernel with parameter a, satisfy ||/* — /*|| > 2e for each 
i + 3- 

Proof. Construct N > 1 pairwise neighboring databases each on n examples in M 2 as follows. Each database 
i has n — 1 negative examples x^.i = . . . = Xj jn _i = 0, and database Dj has positive example Xi.„ = 
(cos 0j, sin 9i) where = ^p. Consider the result of running SVM with hinge loss and RBF kernel on each 
Di. For each database fc(xj iS ,Xj^) = 1 and fc(xj jS ,Xj in ) = exp (— ^s) =: 7 for all s,t € [n — 1] . Notice that 
the range space of 7 is (0, 1). Since the inner-products and labels are database- independent, the SVM dual 
variables are also database-independent. Each involves solving 

max a.'l — -a 1 ( ^ J ] a 

a£l" 2 \ — 7 1 / 

C 

s.t. < a < — 1 

n 

By symmetry a* = . . . = so we can reduce this to the equivalent program on two variables: 

,( n-l\ 1 , f (n-l) 2 - 7 (n-l) \ 
max a — -a , ' . V ' a 

C 

s.t. < a < — 1 

n 

Consider first the unconstrained program. In this case the necessary first-order KKT condition is that 



This implies 



(n-l) 2 -7(n-l) VV n-l 
-7(«-l) 1 j 1 1 

1 / 1 7(n-l) \ / n-l 



(n- 1)2(1 - 7 2) V 7(»-l) («-l) 2 A 1 

1 / 1 7 (n-l)\/n-l 

(n - 1)2(1 - 7 )(l + 7 ) ^ 7 (n - 1) (n-l) 2 A 1 

1 / (n-l)(l+ 7 ) ' 



(n- 1)2(1 - 7 )(l + 7) V (n-l) 2 (l + 7) 



= ^ (n-l^l-7) j 

Since this solution is strictly positive, it follows that at most two (upper) constraints can be active. Thus four 
cases are possible: the solution lies in the interior of the feasible set, or one or both upper box-constraints 
hold with equality. Noting that ^ K _ 1 ^ 1 _ 7 - ) < jz^j it follows that a* is feasible iff < ^. This is equivalent 
to C > T^y n > n ' smce 7 *= (0) !)• This corresponds to under-regularization. 

If both constraints hold with equality we have a* = —1, which is always feasible. 
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In the case where the first constraint holds with equality a\ = — , the second dual variable is found by 



optimizing 



n- 1 \ 1 , / (n- l) 2 -Mn - 1) 
a 2 = max a I — -a , ' . 1 ' \ a 

a 2G K V 1 / 2 V -7(n- 1) 1 

C(n-l) l//C(n-l)\ 2 ^CVn-l) 

max -i '- + a 2 - - -i ^ - 2^ ^a 2 + a z 2 

q 2 £R n 2 \ \ n J n 



2 



1 2 A CVn-l) 
max — - a 2 + ol 2 I 1 1 



a 2 6l 2 \ n 

implying a 2 = 1 + C < 7 I 7r L - This solution is feasible provided 1 + Cj 2 -^- < ^ iff n < ^j^jt^- Again this 
corresponds to under-regularization. 

optimizing 



Finally in the case where the second constraint holds with equality a 2 = — , the first dual is found by 



rc - 1 \ 1 , / (n-lY -7(n- 1) 
a~r, = max a' | — -a , ' , |a 

ai eR V 1 / 2 V ~ !) 1 

= max m — l)ai H (n — 1) a, — 2G7 a\ H ~ 

aiGR n 2 \ ' n n 2 

1,_ ^2 2 , _ A , C7 



= max — -in — 1) a, + «i 1 , 

a 2 ei 2 \ n 

1+^ 1 I c ~ i n 

implying a \ — ( n _ ")2 ■ This is feasible provided ( ra _f)2 < ^- Passing back to the program on n variables, 
by the invariance of the duals to the database, for any pair Di, Dj 

\fi (Xi,n) - fj (Xj,n)| = <(! - &( x i,n,X.7>)) 



> a* ^1 -maxfc(x iirt ,Xg ; „)J . 

Now a simple argument shows that this maximum is equal to 7 4 exp (sin 2 -j^) for all i. The maximum objective 

is optimized when \q — i\ = 1. In this case \9i — 9 q \ = j^. The norm ||x i n — x 9 „|| = 2 sin 2 = 2 sin -j^ by 

basic geometry. Thus k (xj in , x q . n ) = exp ^— Il x '." 2 ^*q."-H ^ = CX p (— A s j n 2 |L) = ^4 CX p ( s j n 2 ^ as claimed. 

Notice that iV > 2 so the second term is in (1, e], while the first term is in (0, 1). In summary we have shown 
that for any i =^ j 

|/»(xi,„) - fj{xi,n)\ > -exp (^--^sin 2 

Assume 7 < |. If n > C then n > y > (1 — j)C in which implies case 1 is infeasible. Similarly since 
67^^ > 0, n > C implies l + Oy 5 ^ > 1 > — which implies case 3 is infeasible. Thus provided that 7 < | 
and n > C we have that either case 2 or case 4 must hold. In both cases a* = — giving 

l/i(Xi,n)-/j(Xi >n )| > ^1 -exp (^--^sin 2 ^ . 



Provided that cr < y sm ^ we have (l — exp (—^2 sin 2 j^)) ^ > (l — |) ^ = Now for small x we 
can take the linear approximation sinx > for x G [0, 7r/2]. If iV > 2 then sin > jj. Thus in this case 

c 



we can take a < y j^r^jj to imply |/j (xj ; „) — /j (xj ; „)| > This bound on a in turn implies the following 
bound on 7: 7 = exp (— 2^2) < exp ^— — 2° Se2 )- Thus taking TV > 4, in conjunction with a < 



2 2_ 

log 2 TV 



implies 7 < \ . Rather than selecting N which bounds a, we can choose N in terms of a. a < J j} ls 



implied by N — ~y i Q g 2 - S° for small a we can construct more databases leading to the desired separation. 
Finally, N > 4 implies that we must constrain a < W 2 ^ 2 . 
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In summary, if n > C and a < J 2l 1 „ then \fc (x iin ) - fj (x i>n )\ > for each i =/= j £ [N] where 



N 



. Moreover if e < ^ then for any i 7^ j this implies — fjW^ > 2e as claimed. □ 

Theorem 20 (Strong lower bound on optimal differential privacy for hinge loss). For C > 0, n > C, 

5 £ (0,1), e £ (0, jgj), flwrf cr < ^ 2 \og 2 ^ xe optimal differential "privacy for the hinge SVM with RBF 
kernel having parameter a is lower-bounded by log e C 1- ^)^ -1 ) ^ where N = ~ lo g § \ ■ That is, under these 

conditions, all mechanisms that are (e,S)-useful wrt hinge SVM with RBF kernel for any a do not achieve 
differential privacy at any level. 

Proof. Consider (e, <5)-useful mechanism M with respect to hinge SVM learning mechanism M with param- 
eter C > and RBF kernel with parameter < a < . / », 1 „ on n training examples, where 5 > and 



^ > e > 0. Let N 



a V log, 2 



> 4. By Lemma 
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2 log c 2 

there exist pairwise neighboring databases D^v 



of n entries, such that {f*}f =1 is an e-packing wrt the Loo-norm, where /* = f^tDA- So by the utility of 
M, for each i £ [N] 

(6-3) Pr (ft £ (f*)) > 1-5, 

Pr (A 6 B? ($)) < Pr (A i B? (ft)) < 6 , 

(6-4) =► 3j ^ 1, Pr (A e Bf (/*)) < 



TV — 1 

Let "Pi and Vj be the distributions of M(D\) and M (-Dj) respectively so that for each, Vi(t) = Pr ( M(Di) = t 
Then by Inequalities ((63} and (|6.4|| 



r e 6 e °° (/; 



/ B » (/;) dPiW - (1-<5)(JV-1) • 



Thus there exists a t such that log Pr ( A " f ( £> 3) *) > (1 g)(JV 1) q 

6 Pr(M(£>i)=t) — 6 S 

Note that n > C is a weak condition, since C should grow like ^Jn for universal consistency. Also note 
that this negative result is consistent with our upper bound on optimal differential privacy: a affects <r p , 
increasing the upper bounds as a j 0. 

7. Conclusion & Open Problems 

We have presented a pair of new mechanisms for private SVM learning. In each case we have established 
differential privacy via the algorithmic stability of regularized empirical risk minimization. To achieve utility 
under infinite-dimensional feature mappings, we perform regularized ERM in a random Reproducing Kernel 
Hilbert Space whose kernel approximates the target RKHS kernel. This trick, borrowed from large-scale 
learning, permits the mechanism to privately respond with a finite representation of a maximum-margin 
hyperplane classifier. We then established the high-probability, pointwise similarity between the resulting 
function and the SVM classifier through a new smoothness result of regularized ERM with respect to per- 
turbations of the RKHS. The bounds on differential privacy and utility combine to upper bound the optimal 
differential privacy of SVM learning for hinge-loss. This quantity is the optimal level of privacy among all 
mechanisms that are (e, <5)-useful with respect to the hinge-loss SVM. Finally, we derived a lower bound on 
this quantity which established that any mechanism that is too accurate with respect to the hinge SVM 
with RBF kernel, with any non-trivial probability, cannot be /3-differentially private for small (3. The lower 
bounds explicitly depend on the variance of the RBF kernel. 

An interesting open problem is to derive lower bounds holding for moderate to large e. Another direction 
for future research is to extend our mechanisms to other kernel methods. Finally, a general connection 
between algorithmic stability and global sensitivity would immediately suggest a number of practical privacy- 
preserving learning mechanisms. 
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Appendix A. Proofs for Subdifferentiable Loss Functions 

The main results were stated in terms of non-differentiable convex loss functions so that they would hold 
for the hinge loss, however the proofs in the main text applied to differentiable loss functions only. For 
completeness we now re-prove the appropriate lemma's for subdifferentiable loss functions of which general 
convex loss functions are a special case. 

In each case the proofs for subdifferentiable loss are essentially identical to the differentiable loss proofs: 
we discuss only the arguments that change when generalizing to non-differentiable loss functions. Previously 
9 W referred to the gradient operator, now it refers to the subdifferential operator. The subscript reminds 
us that we are viewing the operand as a function of w only. Similarly other subscripts extend the notion 
of other partial derivatives. Previously there was a unique gradient at each point, now there may be many 
subgradients, making up the subdifferential set at a point. As we are dealing with sets of subgradients, 
we use the shorthand that for sets S, T, vector v and scalar a that: S + T = {g + h | g G S, h G T}, aS — 
{ag\ge S}, S+v = {g + v | g G S}, (S,v) = {(g,v) | g G S} and S > T means g > h for all (g,h) G SxT. 

Lemma |2lj generalizes Lemma [6] on the sensitivity of the SVM primal weight vector, to general (i.e., 
subdifferentiable) convex loss functions. 

Lemma 21. Consider loss function t(y,y) that is convex and L-Lipschitz in y, and RKHS TL induced by 
finite F -dimensional feature mapping <f> with bounded kernel fc(x, x) < k 2 for all x G R d . Let w s G R F be 
the minimizer of the following regularized empirical risk function for each database S = {(xj, yi)}" =1 



C " 1 

i?rcg(w,S) = _5^( yi ,/ w (x0) + -||w|] 



i=l 



Then for every pair of neighboring databases D,D' of n entries, ||w£> — W£>'||i < ALCk\ F /n. 

Proof. For convenience we define i? emp (w, S) = n^ 1 Y17=i KVii /w(xj) for any training set S, then the first- 
order necessary KKT conditions imply 

(A.l) G d w R Icg (w D ,D) = Cd vr R cmp (w D ,D) + w D . 

(A.2) G d w R Ies (w D ,,D') = Cd w R« ap {w D ',D')+WD> ■ 

Define the auxiliary risk function 

R(w) = C7(a w i?cmp(w£,,L») - <9 w i? cmp (w£K,Zy), w - w D >) + dl w _ w D'\\l ■ 



It is easy to see that i?(w) is strictly convex in w and that R(\vd>) — {0}. And by Equation (A.2 I 

(wd,D)+w g C9 w i?cm P (w£), D) — C<9 w i? emp (w£)/, D') + w — 
= d w .R(w) , 



which combined with Equation (A.l I implies G 9 w i?(wjj), so that i?(w) is minimized at Wjj. Thus there 



exists some non-positive r G R(\vd)- Next simplify the first term of R(wd), scaled by n/C for notational 
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convenience: 

n(<9 w -R emp (w D ,L>) - d w R cmp (w D ,,D'), w --w D >) 

Tl 

= ^2(d w £ (yi, /w D (Xi)) - <9 W ^ (yl f WD , (x-)) ,w-w, 



i=l 
n-1 



= E ( f (»<./w D (xi)) ~ f (Vi, /w D ,(x 4 ))) (/ WD (x t ) - / WD ,(x,-)) 

i=l 

+^(y„,/ WD (x n )) (/ WD (x n ) - / WB( (x n ))-^ (i^./w^CO) (/w D «) - /w D ,K)) 

> ^ (2/n,/w D (x„)) (/w D (x„) - /w D ,(x„)) - ^ /w d , (x'J) (/w d (x'J - /w D , (x'J) , 

where the second equality follows from d w £ (y, / w (x)) = (y, / w (x)) 0(x), where £'(y,y) — dy£(y,y), and 
xj = Xi and = t/j for each is [n. — 1]. The inequality follows from the convexity of £ in its second 
argument^] Combined with the existence of non-positive r G R(wd) this yields that there exists g G 
^/w^K)) (/ w ) - /w D ,W) such that 

n 

^ n 

> ff+^ll w -D- W -D'll2 

And since |g| < 2L ||/ WD — /w D / 1 by the Lipschitz continuity of t, this in turn implies 
(A-3) J1|| W23 _ W23 ,||2 < 2L||/ WD -/ WD ,|| oo . 

Now by the reproducing property and Cauchy-Schwartz inequality we can upper bound the classifier differ- 
ence's infinity norm by the Euclidean norm on the weight vectors: for each x 

|/w D (x) - /w D ,(x)| = \(<f>(x),W D - W£)/)| 

< ll^( x )ll 2 \\ W D - W£)/|| 2 

= y/k{x.,x) ll w £> - W -D'll 2 

< AC \\\V D - W£)/|| 2 . 



Combining this with Inequality (A. 3 1 yields ||w£> — W£)/|| 2 < ALCn/n as claimed. The Li-based sensitivity 
then follows from ||w||i < \/L||w|| 2 for all w G R F . □ 



Next we move to proofs of utility. Lemma [22] mirrors Lemma [12] generalizing the result to non- 
differentiable convex loss functions. 

Lemma 22. Let TL be an RKHS with translation-invariant kernel k, and let 7£ be the random RKHS 



corresponding to feature map (4.1| induced by k. Let C be a positive scalar and loss £{y,y) be convex and 



L-Lipschitz continuous in y. Consider the regularized empirical risk minimizers in each RKHS 

C , 1 
f* G argmin -2^% i ,/(x i )) + -11/11- 



2 

;'■ H II " ' •' H ' 

2—1 



C x" v 1 

g" G argmin - ^ ff (x 4 )) + -\\g\fa 



Lei A"! C R d be any set containing X\, . . . ,X n . For any e > 0, if the dual variables from both optimiza- 



tions have Li-norms bounded by some A > and 
II/* - 9*\L. M < e/2. 



k-k 



< min < 1, — t £ . ■> > i/ien 

oo;Wl " 2 2 (A+2 A /(CL+A/2)AJ 



-■Namely for convex / and any a, 6 6 R, (g a — 9b) (a ~ b) > for all g a £ <9/(a) and all g;, G df(b). 
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Proof. Denote the empirical risk functional i? mp[/] = n 1 S"=i ^ iVii f( x i)) an d the regularized empirical 
risk functional R rcg [f] = C R cmp [f] + ||/|| 2 /2, for the appropriate RKHS norm (either Tt or H). Let /* 
denote the regularized empirical risk minimizer in H, given by parameter vector a*, and let g* denote the 
regularized empirical risk minimizer in H given by parameter vector (3* . Let g a * — Y2%=i a *Vi ( t ) {' x -i) an d 
fp* = S™=i PiVi ( f ) { yL i) denote the images of /* and g* under the natural mapping between the spans of the 
data in RKHS's ThL and Ti respectively. We will first show that these four functions have arbitrarily close 
regularized empirical risk in their respective RKHS, and then that this implies uniform proximity of the 
functions themselves. First observe that for any g E H 

R* s [g] = CR cmp [g] + \h\\ 2 n 

> C(d g R cmp [g*],g- g*) n + C R cmp [g*} + ±\\g\\% 

= (d g R* g {g^g-g*)K-(g\g-g*)n + CR C n ip {g*} + l\\g\\ 2 K ■ 

The inequality follows from the convexity of i? cm p['] and holds for all elements of the sub differential <9 ff i? cmp [g*]. 
The subsequent equality holds by d g R^ g [g) = C d g R cnip [g] + g. Now since G d g R^ g [g*], it follows that 

R%[g] > CiW<?1 + ^N& -<<?*, <?-<?*>« 

= Ci? cmp [ 5 1 + \\\g% + l\\g\^ ±\\g% {g\g-g*)n 

= Rt s \9*] + \\\g\\l - {g\g)n + \u\\\ 
= Rt & [g*] + \\\g-g*\\ 2 n- 

The remainder of Lemma[l2]s proof remains the same, as it does not depend on the loss's differentiability. □ 



